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Abstract — Many networks including social networks, computer 
networks, and biological networks are found to divide naturally 
into communities of densely connected individuals. Finding com- 
munity structure is one of fundamental problems in network 
science. Since Newman's suggestion of using modularity as a 
measure to qualify the goodness of community structures, many 
efficient methods to maximize modularity have been proposed 
but without a guarantee of optimality. In this paper, we propose 
two polynomial-time algorithms to the modularity maximization 
problem with theoretical performance guarantees. The first 
algorithm comes with a priori guarantee that the modularity 
of found community structure is within a constant factor of 
the optimal modularity when the network has the power-law 
degree distribution. Despite being mainly of theoretical interest, 
to our best knowledge, this is the first approximation algorithm 
for finding community structure in networks. In our second 
algorithm, we propose a sparse metric, a substantially faster 
linear programming method for maximizing modularity and 
apply a rounding technique based on this sparse metric with 
a posteriori approximation guarantee. Our experiments show that 
the rounding algorithm returns the optimal solutions in most 
cases and are very scalable, that is, it can run on a network of 
a few thousand nodes whereas the LP solution in the literature 
only ran on a network of at most 235 nodes. 

I. Introduction 

Many complex systems of interest such as the Internet, 
social, and biological relations, can be represented as net- 
works consisting a set of nodes which are connected by 
edges between them. Research in a number of academic 
fields has uncovered unexpected structural properties of com- 
plex networks including small-world phenomenon |[Tj, power- 
law degree distribution |2|, and the existence of community 
structure |3| where nodes are naturally clustered into tightly 
connected modules, also known as communities, with only 
sparser connections between them. 

The detection of community structures in networks is an 
important problem that has drawn an enormous amount of 
research effort |4l. A huge benefit of identifying community 
structure is that one can infer semantic attributes for different 
communities. For example in social networks, the attributes 
for a community can be common interest or location, and for 
metabolic networks the attribute could be a common function. 
Moreover, the relative independence among different commu- 
nities allows the examining of each community individually, 
and an analysis of network at a higher-level of structure. 

There are a wide variety of definitions for communities. In 
general, definitions can be classified into two main categories: 



local definitions and global definitions. In local definitions, 
only the group of nodes and its immediate neighborhood are 
considered, ignoring the rest of the network. For example, 
communities can be defined as maximal cliques, quasi-cliques, 
k-plexes. The most famous definitions in this category are 
notions of strong community, where each node has more neigh- 
bors inside than outside the community, and weak community, 
where the total number of inner edges must be at least half of 
the number of outgoing edges. 

In global definitions, communities can be only recognized 
by analyzing the network as a whole. This type of defini- 
tions is especially suitable when the next phase after the 
community detection is to optimize a global quantity, for 
example, minimizing the inter-group communication cost. The 
most widely-used quantity function in the global category is 
Newman's modularity which is defined as the number of edges 
falling within communities minuses the expected number in 
an equivalent network with edges placed at random |5|. A 
higher value of modularity, a better community structure. 
Thus, identifying a good community structure of a given 
network becomes finding a partition of networks so as to 
maximize the modularity of this partition, called modularity 
maximization problem. 

Since the introduction of modularity, maximizing modular- 
ity has become primal approaches to detect community struc- 
ture. Numerous computational methods have been proposed, 
based on agglomerative hierarchical clustering |6|, simulated 
annealing ||7|, genetic search ||8], extremal optimization ||9j, 
spectral clustering fTOl, multilevel partitioning [TT], and many 
others. For a comprehensive view of community detection 
methods, we refer to an excellent survey of S. Fortunato and 
C. Castellano |4|. 

Unfortunately, Brandes et al. |12| have shown that modu- 
larity maximization is an NP-hard problem, thereby denying 
the existence of polynomial-time algorithms to find optimal 
solutions. Thus, it is desirable to design polynomial-time 
approximation algorithms to find partitioning with a theoretical 
performance guarantee on the modularity values. 

In contrary to the vast amount of work on maximizing 
modularity, the only known polynomial-time approach to find 
a good community structure with guarantees is due to G. Agar- 
wal and D. Kempe |13 | in which they rounded the fractional 
solution of a linear programming (LP). The value obtained 
by the LP is an upper bound on the maximum achievable 



modularity. Thus, their approach provide a posteriori guarantee 
on the error bound. In fact, the modularity values found 
by their approach are optimal for many network instances 
comparing with the optimal modularity values provided by 
expensive exact algorithms in |14|. The main drawback of 
the approach is the large LP formulation that consumes both 
time and memory resources. As shown in their paper, the 
approach can only be used on the networks of up to 235 nodes. 
Secondly, while the approach performs well on all considered 
networks, it does not promise any priori guarantees as provided 
by approximation algorithms. 

In this paper, we address the main drawback of the rounding 
LP approach by introducing an improved formulation, called 
sparse metric. We show that our new technique substantially 
reduces the time and memory requirements both theoretically 
and experimentally without any trade-off on the quality of 
the solution. The size of solved network instances raises from 
hundred to several thousand nodes while the running time on 
the medium-instances are sped up from 10 to 150 times. 

Our second contribution is an approximation algorithm that 
finds a community structure in networks with modularity 
values within a constant factor of the optimum when the 
considered networks have power-law degree distributions. To 
our best knowledge, it is the first approximation algorithm 
for finding community structure in networks. The algorithm is 
not only of theoretical interest, but also establish a connection 
between the power-law degree distribution properties and the 
presence of community structure in complex networks. Since 
community structure are often observed together with the 
power-law property, studying the community structure detec- 
tion under power-law network models is of great important. 

Organization. We present definitions and notions in Section 
im We propose in Section [III] the sparse metric technique to 
efficiently maximize modularity via rounding a linear pro- 
gramming. An approximation algorithm for networks with the 
power-law degree distribution (so-called power-law networks) 



is introduced in Section IV We show experimental results 



for the sparse metric in Section [V] to illustrate the time 
efficiency over the previous approach. Finally, in Section 
fVll we summarize our results and discuss on limitation of 
modularity as well as the corresponding resolution. 

II. Preliminaries 

A network can be represented as an undirected graph G = 
{V,E) consisting of n = \V\ nodes and m = \E\ edges. 
The adjacency matrix of G is denoted hy A = [Aij), where 
Ai,j ~ Aj^i = 1 if i and j share an edge and Ai,j ~ Aj^i = 
otherwise. 

A modularity maximization problem asks us to identify a 
community structure C = {Ci, C2, ■ ■ ■ , Ck} of a given graph 
where each disjoint subsets Ci are called communities and 
Ui=i Ci = V so as to maximize the modularity of C. Note 
that k is not a pre-defined value. The modularity |10| of C is 
the fraction of the edges that fall within the given communities 
minus the expected number of such fraction if edges were 
distributed at random. The randomization of the edges is done 



so as to preserve the degree of each vertex. If nodes i and j 
have degrees di and dj, then the expected number of edges 
falling between i and j is . Thus, the modularity, denoted 
Q, is then 



2m 



2m 



(1) 



1, if i,j are in the same communities 

0, otherwise. 
We also define modularity matrix B \ lQj as 



where Sij = 



Bij — Aij 



didj 
2m 



We note that each row and column of B sum up to zero, hence, 
B always has the vector (1, 1,1,.. .) as one of its eigenvectors. 
The same property is also known for the network Laplacian 
matrix L = D — A, where D is diagonal matrix with the ith 
entry to be di. Laplacian matrix L is widely-used in spectral 
methods for the graph partitioning that is closely related to 
our community detection problem. We note that the major 
difference between the modularity matrix and the Laplacian 
matrix is that L is positive-definite while B is indefinite. As 
a consequence, while approximation algorithms for the graph 
partitioning problem using Laplacian matrix L are available, it 
is not known if such algorithms are possible for the modularity 
maximization problem. 

III. Linear Programming Based Algorithm 

A. The Linear Program and The Rounding 

The modularity maximization problem can be formulated as 
an Integer Linear Programming (ILP). The linear program has 
one variable di_j for each pair of vertices to represent 
the "distance" between i and j i.e. 



if i and j are in the same community 

1 otherwise. 



In other words, di,j is equivalent to l — 5ij in the definition ([T]) 
of modularity. Thus, the objective function to be maximized 
can be written as Bij{l — di,j). We note that there should 



2 J 



be no confusion between di,j the variable representing the 
distance between vertices i and j and constant di (or dj), 
the degree of node i (or j). The ILP to maximize modularity 
(IPcomplct c) is as follows 



maximize 



subject to 



(2) 



dj,k - di^k > 0, < j < k (3) 

dj,k + d,^k > 0, yi<j<k (4) 

- J + dj^k + di^k > 0, yi < j < k (5) 

dije [0,1], i,je[l..n], (6) 

Constraints Q, and (j5]l are well-known triangle inequal- 
ities that guarantee the values of di,j are consistent to each 



other. They imply the following transitivity: if i and j are in 
the same community and j and k are in the same community, 
then so are i and k. By definition, di,i = Vi and can be 
removed from the ILP for simplification. 

To avoid solving ILP, that is also NP-hard, we instead 
solve the LP relaxation of the ILP, obtained by replacing the 
constraints d^j e {0,1} by dij e [0,1]. We shall refer to 
the IP described above as IPcompictc and its relaxation as 
LPcompictc- If the optimal solution of this relaxation is an 
integral solution, which is very often the case | [14| , we have a 
partition with the maximum modularity. Otherwise, we resort 
on rounding the fractional solution and use the value of the 
objective as an upper-bound that enables us to lower-bound 
the gap between the rounded solution and the optimal integral 
solution. 

G. Agarwal and D. Kempe | [T3) use a simple rounding 
algorithm proposed by Charikar et al. 1 15 1 for the correlation 



nique, denoted by IPsparsc 



clustering problem fT6l. The values of dij are interpreted as a 
metric "distance" between vertices. The algorithm repeatedly 
groups all vertices that are close by to a vertex into a 
community. The final community structure are then refined 
by a Kernighan-Lin 1 17] based local search method. 

Since the rounding phase is comparatively simple, the 
burden of both time and memory comes from solving the large 
LP relaxation. The LP has ('2') variables and 3(3) = e{n^) 
constraints that is about half a million constraints for a network 
of 100 vertices, thereby limiting the the size of networks to 
few hundred nodes. Thus, there is a need to achieve the same 
guarantees with smaller resource requirements. By combining 
mathematical approach with combinatorial techniques, we 
achieve this goal in next subsection. 



B. The Sparse Metric 

In this subsection, we devise an improved LP formulation 
for the modularity maximization problem with much fewer 
number of constraints while getting the same guarantees on 
the performance. 

Instead of using 3(3) triangle inequalities to ensure that dij 
is a metric (or pseudo-metric as defined later), we show that 
only a compact subset of inequalities, so-called sparse metric, 
are sufficient to obtain the same fractional optimal solution. 

A function d is a pseudo-metric if d{i,j) — dij satisfy the 
following conditions: 

1) d{i,j) > (non-negativity) 

2) d{i,i) = (and possibly d{i,j) = for some 
distinct values i ^ j) 

3) d{i,j) = d{j,i) (symmetry) 

4) d{i,j)<d{i,k)+d{k,j) (transitivity). 

It is clear that d is an feasible solution of LPcompictc if and 
only if d is a pseudo-metric within the interval [0, 1]. 

Our new linear programming with the Sparse Metric tech- 



maxirmze 



subject to 



, is as follows: 

EBijdi 



2m 



di^k + dkj > dij 
dij e {0, 1}, 



(7) 

fc€iV(z,i) (8) 
(9) 



The objective can be simplified to —^''^^Bijdij since 

Bi j = 0. Let N{i) and N{j) denote the set of neighbors 

of i and j, respectively. The set A^(i, j) is defined as the union 
of neighbors of i and j 

Nii,j) = N{i)UNij)-{i,j} 

Therefore, the total number of constraints in the formula is 
upper bounded by 

n 

di + dj — {n — 1) di — 0{mn) 

When the considered network is sparse, which is often true for 
complex networks, our new formulation substantially reduces 
time and memory requirements. For most real-world network 
instances, where n k, m, the number of constraints is effec- 
tively reduced from Oirr") to O(n^). If we consider the time 
to solve linear programming to be cubic time the number of 
constraints, the total time complexity for sparse networks im- 
proves to O(n^) instead of 0{rv') as in the original approach. 
In practice, LPs can be solved quite efficiently. We mention the 
increase of the size of the largest solved instance of traveling 
salesman problem from 49 cities in 1954 |18| to 85,9000 
cities in 2009 | [T9| as an example of rapid development of 
mathematical programming solvers and computer powers. 

Again, we can obtain the relaxation of IPgparse, described 
in (j?]) to (j9|, by replacing the constraints dij G {0, 1} by 



dij e [0, TJ. We shall refer to this relaxation of IPsparsc as 



LPsparsc- The fractional optimal solution of this relaxation can 
also be rounded and tuned with the same algorithms in the 
previous subsection. 

C. Correctness and Performance Guarantees 

In order to achieve the same guarantees provided by solving 
LPcompictc, we show the equivalence of the sparse formulation 
and the complete formulation: 

• IPsparsc and IPcompictc share the same set of optimal 
integral solutions (Theorem [TJ. 

• The optimal fractional solutions of LPsparsc and 
LPcompictc have same objective values (Theorem [2]i i.e. 
they provide the same upper bound on the maximum 
possible modularity. 

Hence, solving LPgparse indeed gives us an optimal solution 
of LPcompictc, while doing so significantly reduces the time 
and memory requirements. 

Theorem 1: Two integer programmings IPsparsc and 
IPcompictc share the same set of optimal solutions. 




Fig. 1: Clique expanding process. 



Proof: We need to show that every optimal solution of 
IPcompiotc is also a solution of IPsparso and vice versa. 

In one direction, since the constraints in IPsparsc is a 
subset of constraints in IPcompiotc, every optimal solution of 

IPcompiotc will also be a solution of IPsparso- 

In the other direction, let di_j be an optimal integral solution 
of IPsparso- We shall prove that dij must be a pseudo-metric 
that implies ^ is also a feasible solution of IPcompiotc- 

For convenience, we assume that the original graph G = 
(V, E) has no isolate vertices that were known to have no 
affects on modularity maximization |10|. Construct a graph 
Gd = iy^Ed) in which there is an edge for every 

di^j = 0. Let Cd = {C^,C^, . . . , C^} be the set of connected 
components in Gd, where represents the set of vertices in 
tth connected components. 

Proposition 1: Every connected component induces a 
connected subgraph in G = {V, E). 

Proof: We prove by contradiction. Assume that the con- 
nected component C\ does not induce a connected subgraph 
in G. Hence, we can partition into two subsets S and T 
so that there are no edges between S and T in G. 

Construct a new solution d' from d by setting d[ ^ = 1 for all 
pairs (z, j) e P{S, T), the set of pairs with one end point in S 
and one endpoint in T. Since, Ai^ = y{i,j) G P{S,T), we 
have = A^,j - ^ < V(i, j) G P{S,T). Hence, setting 
d'^ j — 1 V(i,j) e P{S,T) can only increase the objective 
value. In fact, doing so will strictly increase the objective. 
There must be at least one pair (i, j) € P{S, T) with di^ = 0, 
or else is not a connected component in Gd- 

It is not hard to verify that d'^ j satisfy all constraints 
of IPsparso sincc thosc triangle inequalities must involve at 
least one edge in the original graph G, while S and T are 
disconnected sets in G. 

Thus, we have derived from an optimal solution a new 
feasible solution with higher objective (contradiction). ■ 

The rest is to prove that for each connected component G^ 
of Gd, if i,j G G^ then the distance dij = 0. We prove 
by repeatedly applying a "clique expanding" process. At each 
step, every pair of vertices in the clique are proven to have 
distance 0. Then, we expand the clique, adding one more 
adjacent vertex to the clique and prove that the new clique 
also has vertices of distance zero from each other (see Fig. [T}. 

Initial step. We first prove there is an edge & E of the 
original graph G satisfying dij — 0. We shall choose that edge 
as our initial clique of size 2. Assume no such edge exists, all 



pairs di j = within G^ have Ai j — and Bi j < 0. Thus, 
again we can increase the distance of all pairs with dij — 
to 1 without violating any constraints, while increasing the 
objective value (contradiction). Therefore, we can always find 
an edge that belongs to both G and Gd- 

Expanding steps. Denote our clique by Kt- If Kf = G^, 
then we can complete the proof for G^. Otherwise, there is 
a vertex u E Kt and a vertex v E G^ — Kt, so that {u,v) 
is an edge in both G and Gd {du,v = 0). The existence of 
such an edge {u, v) can be proven by contradiction (Assume 
not, then increase distance of all pairs in P{Kt,C^ — Kt) 
from to 1 to increase the objective value while not violating 
any constraints.). Then, for each vertex w E Kt — {u}, the 
constraint d^^^u + du.v > d^.v is in IPsparso and dw,u = from 
the property of Kt- It follows that dm ,j = for all w E Kt- 
By adding v to Kf we increase the size of the clique, while 
ensuring the zero-distance property. 

Since the size of G^ is at most n, the expanding process 
will finally terminate with Kt — G^. ■ 

Theorem 2: LPsparsc and LPcompiotc share the same set of 
fractional optimal solutions. 

Proof: We need to show that every fractional optimal 
solution of LPcompiotc is also a fractional solution of LPsparsc 
and vice versa. Since the integrality constraints have been 
dropped in both LP relaxations, we need a different approach 
to the proof in Theorem [T] 

One direction is easy, every fractional optimal solution of 
LPcompiotc is also a fractional solution of LPsparso- 

For the other direction, let dij be a fractional optimal 
solution of LPsparsc, wc shall prove that dij is also a feasible 

solution of LPcompiotc - 

Associate a weight Wi,j — dij for each edge E E 

(other edges are assigned weights oo). Let d'^ j be the distance 
between two nodes with the new edge weights. We have 

1) d'i j > dij for all i,j and d'i ^ = di^j\/{i,i) E E. 

2) d[j = min^^^ld'^j^ + d'^jj. Hence, d'^j is a pseudo- 
metric. 

The first statement can be shown by applying the triangle 
inequalities in LPsparsc- Since, d'^^ be the shortest distance 
between i and j in G, there is a path uq = i,ui, . . . ,ui — j 
with the length dij = d„o,"i + c?mi,«2 + • - - + o^wi-i.u,- Since 
{uk-i,Uk) are edges in G for all k = we can apply 
triangle inequalities iteratively 

dij 1^ duQ^ui ^" — dii^ ui ^ui,ti2 ^~ du2.ui 

^ . . . ^ duQjUi ^~ du-i^u2 ^ ■ • - ^ dui_i.ui — ^i.j (i^) 

If E E, we have d ■ ^ < dij- Hence, d- ^ = dij V(i, j) E 
E. The second statement comes from the definition of d' 

Notice that d^ ^ may be no longer upper bounded by one. 
Therefore, we define d* ^ = min{d^ j^^^- \\a\& 

d*ij > dij and d*j = d^j V(i, j) E E. 

And more importantly, d* is also a pseudo-metric. Since d* f, + 

dL, > min{d^ + d^ ^, 1} > min{d^^^, 1} = d^,- 



Now, if di j = d* j for all then d satisfies all triangle 
inequalities in LPcompictc and we yield the proof. 

Otherwise, assume that dij < d* j for some pair We 
show that d* is a feasible solution of LPgparso with greater 
objective value that contradicts the hypothesis that d is an 
optimal solution. 

Since for all edges ^ E, dij — d* j, and for pairs 

(ij) i E, < and d*^ > we have T,t^jd*j > 
j di.j (contradiction). ■ 

IV. Approximation Algorithms for Maximizing 
Modularity in Power-law Networks 

This section presents approximation algorithms for the mod- 
ularity maximization problem in power-law networks. A factor 
p approximation algorithm for a maximization problem, find in 
polynomial- time a solution with the value no less than p times 
the value of an optimal solution. Approximation algorithms 
are being used for problems where exact polynomial-time 
algorithms are too expensive and in many cases, they can yield 
valuable insights to the problem. 

We make a detour to focus on the problem of modularity 
maximization in division of the network into just two com- 
munities. The maximum modularity value of the division into 
two communities are shown to "close" to the best possible 
modularity. Thus, an approximation algorithm for the division 
into two communities problem also yields an approximation 
algorithm for the modularity maximization problem. 

A. Division into k Communities 

Let Qk be the maximal modularity obtained by a division of 
the network into exact k communities. We also denote = 
maxjLj^ Qi and Qopt = Qt, the best possible modularity over 
all possible divisions. Let 6°^^ be a community structure with 
the maximum modularity Qopt- 

Proposition 2: Qi =0 and Qn = - 4^2' ■ 

Lemma 1: 

1. 
k' 

Proof: If (5°P* has at most k communities, than we have 
— Qopt- Otherwise (5°p* has more than k communities. 
We can rewrite the modularity as 



?^ > (1 - £)Qopt 



Qopt 



1 

2m 



Construct a fc-division of the network by randomly assigning 
communities in 5°p* into one of k new "super" communities. 
Let (5*^ denote the obtained partitioning. If 6°j^ = 1, then 
S^j — 1 i.e. all within intra-communities pairs remain within 
new "super" communities. All pairs with (5°^"* — 

(inter-community pairs) become intra-communities pairs with 
probability 1/fc. Hence, the contribution of a pair {i,j) with 
(5°^* = to the expected modularity is Hence, the 

expected modularity of the A: -division by randomly grouping 



communities will be 

Qe = 

2m 



E j + r E 



1 
2m 



5°"* = 
1 - 



^) E = f 1 - ^) Q 



[Jopt 



In the second step, we have used the equality Bi j = 
or equivalently X]5°p'=i ^i.j = ^ S5°p' Therefore, we 

have QI>Qe = (l - 5) Qopt- ' ' ■ 
It follows from Lemma [T] that an approximation algorithm 
with a factor p for maximizing Q2 will also be an approxima- 
tion with a factor 2p to the modularity maximization problem. 
For a division of the network into two groups define 

{1, if i belong to community 1 
— 1. if i belong to community 2. 

We can write the modularity for the division into two 
communities as 



Q 



1 

4m 



1 



1 

4m 



-^Bx 



Hence, the division into two communities is a special case of 
the maximizing quadratic program problem i.e. the problem of 
finding a vector a; £ {—1, 1}" such that x"^ Bx is maximized. 



The following results was due to M. Charikar et al. |15| and 
Nesterove et al. | |20l . 

Theorem 3: p3) Given an arbitrary matrix A, whose 



diagonal elements are nonnegative, the problem of finding 
X e { — 1, 1}" such that x^Bx is maximized can be approxi- 
mated within O(logn). In case B is positive definite, the ratio 
can be improved to | |20|. 

Unfortunately, the matrix B is not positive definite. Even 
worse, the main diagonal contains all negative entries as the 
ith entry is —4^. Hence, we cannot directly apply above 
results for the division into two communities problem. 
B. Power-law Networks 

Complex networks including social, biological, and technol- 
ogy networks display a non-trivial topological feature: their 
degree sequences can be well-approximated by a power-law 
distribution I^SJ. At the same time they exhibit modular prop- 
erty i.e. the existence of naturally division into communities. 
We establish the connection between the power-law degree 
distribution property and the modular property, stating that 
whenever a network have power-law degree distribution, there 
is presence of communities in the network with a significant 
modularity. 

We use the well-known P{a, (5) model by F. Chung and L. 



Lu 1 21 1 for power-law networks in which there are y vertices 
of degree x, where x and y satisfy log y = a — (3 log x. In 
other words, 

\{v : d{v) = a;}| = y = — J 



TABLE I: Order and size of network instances 




(a) Following algorithm 



(b) Optimal community structure 



Fig. 2: On the left, a community structure found by Following 
Algorithm in Theorem |4] when do ~ 2. Each rounded square 
represents a community, and followees are in the darker color 
The modularity is 0.325 i.e. 87% of the optimal modularity, 
0.374. On the right, the optimal community structure found 

by solving IPsparsc- 

Basically, a is the logarithm of the size of the graph 
(n = e") and f3 is the log-log growth rate of the graph. 
While the scale of the network depends on a, f3 decides the 
connection pattern and many other important characterizations 
of the network. Different networks at different scales with 
same /3 often exhibit same characteristics. For instance, the 
larger /3, the sparser and the more "power-law" the network 
is. Hence, /3 is regarded as a constant in P{a, (3) model. 

In P(a, /3) model, the maximum degree in a P{a, f3) graph 
is e?. The number of vertices and edges are 

C(/3)e" if /3 > 1 
if /? = 1 

if /? < 1 
iC(/3-l)e" if/3>2 

if/3 = 2 (11) 
if /3 < 2 



2^ -Zp 



ae 



m 



1 ^ e" 

X—1 



1 ejL 

2 2-/3 



where — TlTLi jjs is the Riemann Zeta function. Without 
affecting the conclusions, we will simply use real number 
instead of rounding down to integers. The error terms can 
be easily bounded and are sufficiently small in our proofs. 

Most real-world networks have the log-log growth rate 
/3 between 2 and 3. For examples, scientific collaboration 
networks with 2.1 < /? < 2.45 |j22|. Word Wide Web with 
/? for in-degree and out-degree of 2.1 and 2.45, respectively 
||23l; Internet at router and intra-domain level with [3 = 2.48 
and so on. No power-law networks with /3 < 1 have been 
observed. One of the reason is that when f3 < 1, the number 
of edges m = f^(n^) i.e. the network is not "scale-free". 

Theorem 4: There is an O(logn) approximation algorithm 
for the modularity maximization problem in power-law net- 
works with the log-log growth rate /3 > 1. If /3 > 2, the 



Problem ID 



Name 



Nodes n Edges m 



1 


Zachary's karate club 


34 


78 


2 


Dolphin's social network 


62 


159 


3 


Les Miserables 


77 


254 


4 


Books about US politics 


105 


441 


5 


American College Football 


115 


613 


6 


US Airport 97 


332 


2126 


7 


Electronic Circuit (s838) 


512 


819 


8 


Scientific Collaboration 


1589 


2742 



problem can be approximated within a constant approximation 
factor 2C(/3 — 1), where C(.t) = X^i^i the Riemann Zeta 
function. 

Proof: From Lemma (jlji with A; = 2, we have ^Qopt < 
Q\. Hence, it is sufficient to approximate Q\ within a factor 



of O(logn). 
We have 



x^Bx 



UA = - — max 
4to xe{~iAy 

1 " 
= - — max x'^Bqx-'S^—^, (12) 
4m xe{-i.i}" ^ — ' 8m"^ 

i—l 

where Bq is obtained by replacing the diagonal of B with 

zeros. ^ I I 

Let D = X]"=i the second term in equation J 12b. We 
can approximate 



OPTo = max 



x'^ BqX — 



D 



within a factor of O(logri) by the method in Theorem [3] 
That means we can find a division of the network into two 
communities with the modularity is at least 

' - ^ -(Q+ 



-OPTo - D 



log n log n 



log n 2 log n 

where c is an independent constant. 



D 



D 



If we can show that D 



log? 



OPTq ) , then we can ap- 



proximate the maximum modularity within a factor 0(logn 
This is equivalent to 

Qopt 1. Qopt 



lim — --^ — = c» or lim — ^ — = oo 
n-i-oo D log n Q-i-oo D log n 



(13) 



To show ( |T3] l, we present a linear-time algorithm, called 
Following, to find a community structure C with a lower bound 
on the modularity. An illustration example for the algorithm 



is shown in Fig. 2a 



Following Algorithm ( Parameter do G N+) 

i. Start with all nodes unlabeled 

ii. Sort nodes in non-decreasing order of degree 

iii. For each unlabeled node v with dy < do, find a 
neighbor u that is not a follower; set v to follow u 
i.e. label v "follower" and u "followee". If many such 
u exist, select the one with the minimum degree. 

iv. Label all unlabeled nodes "followee". 

V. Put each followee and its followers into a community. 



Despite that higher values of possibly lead to better 
approximation ratios, it is sufficient for our proof to con- 
sider only the case do = 1. That means all leaf nodes 
will attach to (follow) their neighbors. Assume that for a 
graph G = {V,E), vertices in V are numbered so that leaf 
nodes will have higher numbering than non-leaf nodes i.e. 
V — {vi,V2, ■ ■ ■ ,vt, Vt+i, . . . , Vn} in which t is the number 

v ' 

leaf nodes 

of non-leaf nodes. For a node Vi,i — 1 .. .t, let k < di be the 
number of leaves attached to Vi. There will be t communities 
associated with vi,V2t ■ ■ ,vt, respectively. 

Since there are e" vertices of degree one, there are at least 
ig" edges inside considered communities. Hence, 



Q{C) = 



2m 
2m 



1=1 



Am? 



> 



e 

2m 



^Adl 

i=l 



8D 



(14) 



Since Qopt > Q('C), instead of showing (13 1, we can show 

Q(>c) 



lim 



D \ogn 



= oo <4> lim 



e"/2m 
D \ogn 



From the power-law degree distribution in ( 11 



D = 



x=l 



X 



e 

8m 



Consider all three cases of /3: 

Case (3 > 2: Since x'^~^ < 1, from equation (11 



(15) 



we have 



Q{C) > 



e" 
2m 



8D > 



1 



C(/3-l) C(/3-l)2e« 



> 



Since Qopt < 1, community structure C approximate the 
optimum solutions within a constant factor 2(^{f3 — 1). 
Case P = 2: We have logn < 2a. Hence, 



2e" 



Thus, 



Dlogn < 



,. e"/2m 
lim -— > lim 



oo 2e"/'3 



a-i-oo D log n a- 

Hence, the modularity maximization problem can be approx- 
imated within a factor O(logn) in this case. 
Case 2 > /3 > 1: 



D log n < 



< 



< 



8m^ 
2ae" 



(3-/3) 

x=l 



X 



2-/3 



1 



-2a 



^ re f 



(3-/3) 



(2-/3)2 

(2-/3)2 ^ 



Therefore, 



e"/2m 
lim -— > lim 

Q—i-OO Dlogn Q— i-OO 



e" (3 - /3)e' 



«//3 



2-13 



«(2-/3)2 



a^oo a{2 — p) 
Hence, the theorem follows. 



TABLE II: The modularity obtained by previous published 
methods GN |5] 



EIG ITO), VP IB], LPeo,nplete OUr 

sparse metric approach LPgparse and the optimal modularity 
values OPT y4|. The optimal modularity for network 8 (as a 
whole) has not been known before; we compute it by solving 
our our IPsparsc witWn only 15 seconds. 



ID 


n 


GN 


EIG 


VP 


LPcomplctc 


LPsparsc 


OPT 


1 


34 


0.401 


0.419 


0.420 


0.420 


0.420 


0.420 


2 


62 


0.520 




0.526 


0.529 


0.529 


0.529 


3 


11 


0.540 




0.560 


0.560 


0.529 


0.529 


4 


105 




0.526 


0.527 


0.527 


0.529 


0.529 


5 


115 


0.601 




0.605 


0.605 


0.605 


0.605 


6 


332 










0.368 


0.368 


7 


512 










0.819 


0.819 


8 


1589 










0.955 


0.955 



V. Computational experiments 
We present experimental results for our linear programming 



rounding algorithm in Section III The LP solver is GUROBI 



4.5, running on a PC computer with Intel 2.93 Ghz processor 
and 12 GB of RAM. We evaluate our algorithm on several 
standard test cases for community structure identification, 
consisting of real-world networks. The datasets names together 
with their sizes are are listed in Table |l] The largest network 
consists of 1580 vertices and 2742 edges. All references on 
datasets can be found in | fT3| and |[14J. 

TABLE III: Number of constraints in formulations LPcompiete 
used in papper [13J (Constraint (C)) and the computational 
time (in seconds) (Time(C)) versus number of constraints in 
our sparse metric formulation LPspaise (Constraint(S)) and its 
computational time(Time(S)). 



ID 


n 


Constraint(C) 


Constraint(S) 


Time(C) 


Time(S) 


1 


34 


17,952 


1,441 


0.21 


0.02 


2 


62 


113,460 


5,743 


3.85 


0.11 


3 


77 


219,450 


6,415 


13.43 


0.08 


4 


105 


562,380 


30,236 


60.40 


1.76 


5 


115 


740,715 


66,452 


106.27 


13.98 


6 


332 


18,297,018 


226,523 




197.03 


7 


512 


66,716,160 


294,020 




53.18 


8 


1589 


2,002,263,942 


159,423 




2.94 



Since the same rounding procedure are applied on the opti- 
mal fractional solutions, both LPcompiete and LPgparse yield the 
same modularity values. However, LPgparse can run on much 
larger network instances. The modularity of the rounding LP 
algorithms and other published methods are shown in Table 
[n] The rounding LP algorithm can find optimal solutions ( or 
within 0.1% of the optimal solutions) in all cases. The source 
code for our LP algorithm can be obtained upon request. 



Finally, we compare the number of constraints of the LP 
formulation used in [13] and our new formulation (LPspaisc) 



in Table III Our new formulation contains substantially less 
constraints, thus can be solved more effectively. The old LP 
formulation cannot be solved within the time allowance (10000 
seconds) and the memory availability (12 GB) in cases of the 
network instances 6 to 8. The largest instance of 1589 nodes 
is solved surprisingly fast, taking under 3 seconds. The reason 
is due to the presence of leaves (nodes of degree one) and 
other special motifs that can be efficiently preprocessed with 
the reduction techniques in 1*241. 

Our new technique substantially reduces the time and mem- 
ory requirements both theoretically and experimentally without 
any trade-off on the quality of the solution. The size of solved 
network instances raises from hundred to several thousand 
nodes while the running time on the medium-instances are 
sped up from 10 to 150 times. Thus, the sparse metric 
technique is a suitable choice when the network has a moderate 
size and a community structure with performance guarantees 
is desired. 

VI. Discussion 

We have proposed two algorithms for the modularity 
maximization problem in complex networks. Our algorithms 
successfully exploit sparseness and power-degree distribu- 
tion property found in many complex networks to provide 
performance guarantees on the solutions. On one hand, the 
algorithms implied in Theorem[4]are the first approximation al- 
gorithms for maximizing modularity, hence, are of theoretical 
interest. On the other hand, our sparse metric approach is an 
efficient method to find optimal or close to optimal community 
structure for networks of up to thousand nodes. 

Fortunato and Barthelemy |25| have recently shown that in 
general quality functions of global defintions of community, 
including modularity, has an intrinsic resolution scale, known 
as resolution limit. Therefore, they fail to detect communities 
smaller than a scale, which depends on global attributes of 
networks such as the total size and the degree of connec- 
tion among communities. However, resolution limit can be 
overcome by introducing a scaling parameter A > into 
the original modularity formula as independently proposed by 
Arenas et al. |26| and R. Lambiotte et al. | |27| . 

1 / . . didi 



2m 



E 



A 



A 



2m 



Our proposed methods work naturally with this extension 
with little modification. The only changes in the LP formula- 
tions are in the objective cofficients; the modularity matrix 
B is replaced with a new "multi-scale" modularity matrix 
B'^ with B^j = Ai_j — A^l^. The sparse metric technique 
still applies and provides the same guarantees as solving the 
complete LP formulation. In addition, the constant A does 
not affect the asymptotic approximation ratios of algorithms 
in Theorem [4] Our ongoing work is to design an efficient 
modularity approximation algorithm that both gives a better 
approximation ratio and perform well in practice. 
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